Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) by bigbag · Pull Request #1493 · openai/parameter-golf

bigbag · 2026-04-09T07:18:35Z

Summary

val_bpb = 1.0810 (3-seed mean, std 0.0002) | ~15.99 MB | 8×H100 SXM
SP8192 + 3-layer depth recurrence (L3-5) + parallel residuals (L7+) + QK-Gain 5.25 + legal score-first TTT
No SLOT, no pre-quant TTT, no n-gram cache, no ETLB — fully compliant

3-Seed Results

Seed	Sliding BPP	TTT BPP	Artifact
42	1.0829	1.0808	15,991,930
314	1.0827	1.0810	15,992,919
999	1.0826	1.0812	15,993,232
Mean	1.0827	1.0810	15,992,694
Std	0.0002	0.0002

Merged SOTA (PR #1019): 1.1147 BPP. Delta: −0.0337 BPP.

Key Techniques

SP8192 + GPTQ SDClip — int6 matrices (k=12.85), int8 embeddings (k=20.0), zero pruning (PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 @clarkkev)
3-Layer Depth Recurrence (L3-5, activate at 0.35) — 17 virtual layers from 11 physical
Parallel Residuals (L7+) — GPT-J style (PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412 @Robby955, PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic)
QK-Gain 5.25 — monotonic improvement from 4.0 → 5.0 → 5.25
Legal Score-First TTT — SGD (lr=0.005, mom=0.9), 3 epochs, cosine decay (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 @abaybektursun, PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413 @dexhunter)
Tuned Hyperparameters — WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 (PR [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445 @X-Abhishek-X)
LZMA code wrapper — 16.6KB code footprint

Compliance (Track B)

Per Issue #1017:

Condition 1 (Causality): Sliding-window eval, prefix only
Condition 2 (Normalized): Standard softmax, no n-gram/logit bias
Condition 3 (Score before update): Each chunk scored under torch.no_grad() BEFORE SGD
Condition 4 (Single pass): Each token scored once, no rescoring

No SLOT, no pre-quant TTT, no ETLB, no n-gram cache. All artifacts < 16MB, train < 600s, eval < 600s.

Credits

PR #1394 @clarkkev, PR #1413 @dexhunter, PR #549 @abaybektursun, PR #1412 @Robby955, PR #1204 @msisovic, PR #1445 @X-Abhishek-X, PR #1331 @dexhunter

Acknowledgements

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was instrumental in running 160+ experiments that led to this result.

Reproduction

SEED=42 QK_GAIN_INIT=5.25 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Test plan

3-seed validation (42, 314, 999)
All artifacts under 16,000,000 bytes
Training under 600s (588s actual)
Eval (sliding + TTT) under 600s (~500s actual)
Score-first TTT: compliant with Issue A Field Guide to Valid Submissions #1017 conditions 1-4
No SLOT, no pre-quant TTT, no ETLB, no n-gram cache

🤖 Generated with Claude Code

…25 + Legal TTT — val_bpb 1.0810 (3-seed mean) 3-seed mean: 1.0810 (std 0.0002), seeds 42/314/999 All artifacts under 16MB, training under 600s, eval under 600s Score-first TTT (SGD 3ep, cosine decay), no SLOT, no pre-quant TTT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bigbag · 2026-04-09T07:18:44Z

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was essential for running the experiments that led to this result. The grant covered ~320 compute hours across 160+ experiments over Steps 1-22 of our optimization journey.

…nthesis (validation pending) First submission to stack three independently-legal val-data adaptations on the PR openai#1487 (1.0600) base: 1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations to align quantization with the eval distribution (novel on the modern stack; PR openai#1019 ablated this on its older base only) 3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering (Track B, builds on PR openai#1493) The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487 (1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars. Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val function, plus 8 hyperparameter defaults flipped). Architecture, optimizer, training loop, EMA, and quantization machinery are byte-identical to PR openai#1487. Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100 SXM time on RunPod; see VALIDATION.md. Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time score-first TTT). No SLOT, no n-gram cache, no ETLB. Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun, PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955, PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

@clarkkev

…val_bpb 1.07983 3-seed mean val_bpb 1.07983 (std 0.00050) on the PR openai#1394 sp8192 stack. Changes from PR openai#1394 + PR openai#1413 baseline: - Muon momentum = 0.97 (vs 0.99 default), warmup 0.92→0.97 unchanged - Causal token n-gram tilt (base_beta=2.0, agree_bonus=0.1) on top of legal score-first TTT; within-word and word-start experts explicitly disabled (within_beta=0, word_beta=0) because they cannot be made fully causal. - 3-seed verification (seeds 0/42/1234) Seeds: - seed 0 → 1.07928 bpb / 2.78790 nats / 15,993,346 bytes - seed 42 → 1.07997 bpb / 2.78967 nats / 15,992,995 bytes - seed 1234 → 1.08025 bpb / 2.79039 nats / 15,994,604 bytes - mean → 1.07983 bpb / 2.78932 nats / 15,993,648 bytes Delta vs current merged SOTA PR openai#1493 (1.0810): 0.00117 bpb / 0.00302 nats per token Credits: @clarkkev (base PR openai#1394 sp8192 stack), @abaybektursun (n-gram tilt kernel PR openai#1420, causal fix applied), prior legal-TTT precedent PR openai#549 / PR openai#461. Platform: 8xH100 80GB SXM, PyTorch 2.9.1+cu128. Training 588s, eval <437s per seed, both under the 600s budget. Artifact under 16 MB on all 3 seeds.

Added Parallel Residuals to Block.forward (gated by USE_PARALLEL_RESIDUALS=1): when enabled, attn and mlp branches both consume the same normalized x_in instead of mlp consuming attn's output. This is the technique used by leaderboard openai#1 (PR openai#1493/openai#1477). Inductor can fuse the two branches better and val_bpb improves ~0.005-0.01 BPB. Default off so existing recipes unchanged. Added USE_PARALLEL_RESIDUALS env var wiring in submission/run.sh + config-print line. New submission/dry_run.sh wrapper — single-command launcher for our H100 dry-run config: - NUM_LAYERS=8 MLP_MULT=2 (compute-efficient sweet spot from A6000) - NUM_LOOPS=2 LOOP_START=3 LOOP_END=5 (3-layer recurrence, comp openai#1) - QK_GAIN_INIT=5.25 (comp openai#1) - USE_PARALLEL_RESIDUALS=1 (just ported) - USE_PARALLEL_MUON=1 (our discovery) - MATRIX_BITS=8 USE_CMP_QUANT_VALUE_DEDUP=0 (our int8 fix) - TORCH_COMPILE_MODE=max-autotune-no-cudagraphs USE_CUDNN_BENCHMARK=1 - PREQUANT_TTT_ENABLED=0 (illegal, disabled) - TTT_ENABLED=1 TTT_EPOCHS=3 (legal score-first) - SLIDING_WINDOW_ENABLED=1 - MAX_WALLCLOCK_SECONDS=600 Expected on 1×H100 PCIe: val_bpb ~1.10-1.20 (validates A6000 projection) Expected on 8×H100 SXM: val_bpb ~1.00-1.07 (potentially beats openai#1 = 1.0810) The submission val_bpb to read is the 'legal_ttt_exact val_bpb' line.

…leaderboard openai#1) Verified actual openai/parameter-golf merged leaderboard via gh pr view 1493: PR openai#1493 (val_bpb 1.0810, merged 2026-04-09) is the real leaderboard openai#1. Earlier PHASE2_RESULTS.md notes citing PR openai#1485 (1.0679) and PR openai#1482 (1.0787) were wrong — those PRs are not in the merged set. Pre-Quant AdamW TTT is not in any merged PR; PR openai#1493 explicitly says "no pre-quant TTT". Changes to dry_run.sh: - NUM_LAYERS 8 -> 6 (revert to CHAMP_D validated, l=8 was unvalidated) - EMA_DECAY=0.9965 (PR openai#1493, was default 0.997) - WARMDOWN_FRAC=0.72 (PR openai#1493, was default 0.667) - ENABLE_LOOPING_AT=0.35 (PR openai#1493, was default 0.5) - Comment fix: PreQuant TTT removed "illegal" claim, replaced with "PR openai#1493 explicitly does not use pre-quant TTT" - Header rewrite: now references PR openai#1493 not "leaderboard openai#1 = 1.0810" Changes to PHASE2_RESULTS.md: - Replaced stale comp anchor table with verified merged leaderboard - Added warning about prior bogus PR openai#1485/openai#1482 anchors Note on Parallel Residuals topology mismatch: PR openai#1493 applies parallel residuals from layer 7+ (5 of 11 layers). Our impl is binary and applies to all layers — with NUM_LAYERS=6 that means all 6 layers parallel, which is a different topology than PR openai#1493 has validated. Keeping at USE_PARALLEL_RESIDUALS=1 per user direction; flagging here so it shows up in any post-mortem if results are weird.

…enai#1493 exact match) From PR openai#1493 train_seed314.log Hyperparameters dump: - muon_wd: 0.095 (we default to 0.085) - matrix_lr: 0.022 (we default to 0.020) Both are zero-risk exact-match cheap wins. Decision logged on USE_PARALLEL_RESIDUALS=1: keeping at 1 (all 6 layers parallel) deliberately, not switching to PR openai#1493's L7+ pattern. Reasoning: with NUM_LAYERS=6 the "early layers need serial composition" principle bites less hard than at 11L, and we want max speed for more steps on 1xH100 PCIe. We're trying to BEAT 1.0810, not match it -- aggression is required somewhere, parallel residuals are a low-risk place to find it. Two-lane PARALLEL_START_LAYER mechanism (default 7, no-op for 6L) deliberately left untouched -- separate untested architecture, save for post-dry-run experiments. Decision logged on second dry run as "match PR openai#1493 exactly": explicitly rejected by user. We bet on our smaller-model + int8 stack, not a literal reproduction.

… + records folder Three changes per user direction: 1. train.py: rename timed_eval label legal_ttt_exact -> quantized_ttt to match comp convention (PR openai#1493 uses this exact label). Pure cosmetic 1-LOC fix, no behavior change. 2. dry_run.sh: refactor to be the SINGLE canonical entry point for both dry run and real submission via SEEDS env var: bash submission/dry_run.sh # dry run (default SEEDS=42) SEEDS=42,314,999 bash submission/dry_run.sh # real 3-seed submission Same code path, env-flip only. The whole config (architecture, hyperparams, n-gram stack, TTT) is identical between the two modes -- only the seed loop differs. 3. dry_run.sh: assemble a complete comp records folder under records/track_10min_16mb/<date>_<config-tag>/ with: README.md, submission.json, train_gpt.py, train_seed<N>.log per-seed logs, and per-seed final_model_seed<N>.int6.ptz artifacts. submission.json is generated by an inline python script that: - parses each seed's train log for the quantized_ttt val_bpb line - computes mean + std across seeds - detects hardware via nvidia-smi - fills the compliance flags honestly (no_ngram_cache: false since we DO use n-gram bias -- this is potentially a Track B rule problem, flagged in the README for follow-up) - emits the 36-line submission.json format that PR openai#1493 uses README.md is templated with per-seed results table, technique list, compliance section, reproduction instructions, attribution. train.py is copied as train_gpt.py to the records folder (NOT LZMA-wrapped yet -- that's a code-size compliance follow-up if/when needed). Note on n-gram legality: PR openai#1493's compliance section says "no n-gram cache, no logit biasing" per Issue openai#1017 Track B. Our submission flags no_ngram_cache: false honestly. Whether this submission is comp-legal under Track A or any other track is an open question that needs resolution before merging as a record. Flagged in the README.

Two changes per user direction (rule compliance + comp file format): 1. DISABLE n-gram bias stack (rule compliance) USE_NGRAM_BIAS=0, USE_NGRAM_BACKOFF=0, USE_NGR_LOG_FREQ_INV=0, USE_CTX_PARTITIONED_TAB=0. Reason: PR openai#1493's compliance section cites Issue openai#1017 Track B Condition 2: "Standard softmax over full vocab. No n-gram cache, no logit biasing." Our USE_NGRAM_BIAS adds precomputed n-gram log-prob bias to logits at the end of forward(), which directly violates this condition. We don't yet know whether the rule applies only to Track B (legal-eval-time-adaptation track) or to all submissions, but the user's policy is clear: nothing illegal. Disable until verified. N-gram tables are still BUILT during get_data.sh bootstrap (cheap, no harm) but unused at training/eval time when USE_NGRAM_BIAS=0. Other Phase 1 wins kept (all believed legal): - USE_GATED_ATTENTION (architectural, NeurIPS 2025) - USE_NORMUON (optimizer variant) - USE_NORM_PCT_DROPOUT (training-time regularizer) - USE_PREFETCH_LOADER (data pipeline) 2. LZMA-wrap train_gpt.py (PR openai#1493 file format) The records folder assembly step now LZMA-wraps submission/train.py into a 2-line train_gpt.py matching PR openai#1493's format: import lzma as L,base64 as B exec(L.decompress(B.b85decode("..."),format=L.FORMAT_RAW,...)) Sanity-decodes after wrapping to verify the roundtrip. Sizing: - submission/train.py raw: 83,320 bytes - LZMA-wrapped train_gpt.py: 28,916 bytes (34.7% of raw) - PR openai#1493 wrapped train_gpt.py: 16,594 bytes - Our artifact (CHAMP_D int8): ~9,555,838 bytes (~9.55 MB) - Total submission (artifact + code): ~9.58 MB / 16 MB cap (60%) Plenty of code-size headroom. Our train.py is bigger than PR openai#1493's because we carry more infrastructure (n-gram code, NIGHT_MODE features, optional speed paths) but the wrapped form fits comfortably.

…8 + dedup User decision: stop betting on smaller-model + int8 alone (CHAMP_D 6L+2x) because on 8xH100 SXM the binding constraint is model capacity, not training compute. Flipping to PR openai#1493's proven architecture (11L+4x) and stacking our int8 quant + parallel muon + parallel residuals on top. Changes: - NUM_LAYERS: 6 -> 11 (match PR openai#1493) - MLP_MULT: 2 -> 4 (match PR openai#1493) - USE_PARALLEL_RESIDUALS: 1 -> 0 (binary all-layers flag, replaced by below) - PARALLEL_RESIDUAL_START: 7 (NEW per-block start parameter, matches PR openai#1493 exactly: layers 0-6 serial, layers 7-10 parallel residual GPT-J style) - USE_CMP_QUANT_VALUE_DEDUP: 0 -> 1 (RE-ENABLED, NIGHT_MODE n=2 confirmed L10 alphabet-snap compression. Was disabled with int8 because I assumed it would hurt cleanliness -- that assumption was never validated. Re-enabling because (a) we need ~10-15% compression to fit 11L+4x int8 in 16 MB cap and (b) restoring a previously validated win I dropped without good reason. - Records folder tag: SP8192_NL11_MLP4_int8_ParMuon_PR7_LegalTTT train.py changes (6 LOC): - Block.__init__: now reads PARALLEL_RESIDUAL_START env var, sets _parallel_residuals=True for layer_idx >= PARALLEL_RESIDUAL_START. Falls back to USE_PARALLEL_RESIDUALS binary flag if PARALLEL_RESIDUAL_START=-1. - Block.__init__: stores layer_idx as self.layer_idx for the check - Hyperparameters: added parallel_residual_start field (env-driven, default -1) Math: - PR openai#1493 baseline: 1.0810 - Int8 quant savings (vs their int6): -0.011 BPB - Parallel muon: ~0 BPB (speed only) - CMP_QUANT_VALUE_DEDUP: ~+0.005 BPB cost from alphabet snap - Net projection: ~1.072-1.078 - Probability of beating 1.0760 (record threshold): ~30% Risks: - Int8 quant at 11L+4x scale is UNTESTED (CHAMP_E was killed mid-run) - 11L+4x int8 + brotli + dedup might still be over 16 MB cap (CHAMP_D had 9.55 MB at 6L+2x; this is ~1.7x more params, projected ~14-16 MB) - PARALLEL_RESIDUAL_START is brand new code, never run end-to-end Pre-flight: dry_run.sh syntax check passes.

…ng silently disabled) THE BIGGEST DROPPED WIN, found via deep audit of our experiment history. Bug: run.sh:73-74 hardcodes: TORCH_COMPILE_DISABLE="${TORCH_COMPILE_DISABLE:-1}" TORCHDYNAMO_DISABLE="${TORCHDYNAMO_DISABLE:-1}" dry_run.sh was setting TORCH_COMPILE_MODE=max-autotune-no-cudagraphs but that env var does NOTHING when TORCH_COMPILE_DISABLE=1 is in effect, so the compile path never engaged. The dry_run was running in eager mode the entire time despite the explicit "compile mode" config. Phase 2 evidence (PHASE2_RESULTS.md): - E1 (compile disabled, baseline): 2933 ms/step - E2 (compile re-enabled with default mode): 1581 ms/step (+85% / 1.85x) - E4b (compile + max-autotune-no-cudagraphs): 1526 ms/step (+92% / 1.92x) Measured on RTX 3090. On 8xH100 SXM with the 11L+4x model the speedup will be more like 3-5x because H100 has way more matmul throughput that the eager-mode kernel launch overhead bottlenecks. Impact: without compile, our 600s training budget gets us approximately HALF the training steps PR openai#1493 gets at the same architecture. Their 4557 steps -> our ~2200 steps without compile. Catastrophic convergence loss. With compile re-enabled we should match or exceed their step count. Fix: explicitly export TORCH_COMPILE_DISABLE=0 and TORCHDYNAMO_DISABLE=0 in dry_run.sh BEFORE bash submission/run.sh. The variables are already in run.sh's explicit env-passing list at line 251-252 so the override propagates correctly. Caught via Explore agent audit of all PHASE2_RESULTS, NIGHT_MODE.md, PHASE2_PLAN.md, run.sh, and submission/train.py to find any validated win not in the current dry_run.sh.

…ission Phase 2 was the speed/quality experimentation work (E1-E31, CHAMP_A/B/C/D/E/F). That's done. The current 8xH100 SXM run is the REAL openai/parameter-golf submission attempt and deserves its own state file. Created SUBMISSION_RUN_STATE.md with: - Pod info (aklt7paqnjwhal, 8x H100 SXM, $21.52/hr) - Full Option C config dump - Targets (PR openai#1493 = 1.0810, record threshold = 1.0760) - Output records folder location - Fire log table (ready for the cron to append per-fire) Removed the Pod O block from PHASE2_AUTOMATION_STATE.md (I had wrongly added it there during the 01:57Z fire). PHASE2_AUTOMATION_STATE.md now ends with "Phase 2 work is complete" and points at SUBMISSION_RUN_STATE.md. Cron be912385 deleted, replaced with 49457147 (same 10-min schedule, same pod) — new prompt writes to SUBMISSION_RUN_STATE.md, tags commits [submission] instead of [phase2-driver].

… single-GPU) THE bug: run.sh hardcoded `python3 -u submission/train.py` which always spawns a single Python process -> world_size=1 -> ONE GPU used. On the 8xH100 SXM real submission run we caught this with the GPU dashboard showing only GPU 7 at 100% and the other 7 idle. We were paying for 8 GPUs and using 1. PR openai#1493 launches with: torchrun --standalone --nproc_per_node=8 train_gpt.py train.py already supports distributed via WORLD_SIZE/RANK/LOCAL_RANK env vars (see train.py:1065-1071) -- it just needs a torchrun launcher. Fix: auto-detect GPU count via nvidia-smi, use torchrun when > 1 GPU, fall back to python3 for single-GPU runs (preserves the local 1xPCIe dry-run path). NPROC_PER_NODE override is honored if set (lets us cap at 4 if we want to run partial-machine experiments). The Explore agent flagged this earlier in the audit. I noted it but said "not needed for dry run on 1xH100 PCIe" -- which was the wrong call for the real 8xH100 SXM submission. Should have fixed it in the same pass as the torch.compile re-enable. My miss, costs ~$13 of pod time.

…at 11L+4x) Seed 42 results from retry 4: - pre-quant val_bpb 1.0896 — EXCELLENT (0.002 from PR openai#1493's 1.0878) - int8 quantized val_bpb 4.5461 — CATASTROPHIC (3.46 BPB gap) - artifact 19,559,800 bytes — OVER 16 MB CAP (19.6 MB) Root cause: 36M params × 8 bits per param = too many bytes for brotli to compress under 16 MB. CMP_QUANT_VALUE_DEDUP=1 made it worse (post-quant alphabet snap destroyed the fine weight structure on top of the size issue). Fix: switch to MATRIX_BITS=6 + EMBED_BITS=8 (PR openai#1493's exact setup). Proven to fit 16 MB. Proven quant gap 0.012 BPB. Disable dedup. Also: explicitly pass WARMDOWN_FRAC, EMA_DECAY, ENABLE_LOOPING_AT, MUON_WD, MATRIX_LR, PARALLEL_RESIDUAL_START, MATRIX_BITS, EMBED_BITS in run.sh's env-passing list for torchrun. Env inheritance WAS working (verified from seed 42 log) but explicit is safer with torchrun multi-process. Projected with int6: pre-quant ~1.089, quant gap +0.012, sliding -0.017, TTT -0.002 = final ~1.082. Close to PR openai#1493's 1.081 but likely not a record (threshold 1.076). Running anyway — the NIGHT_MODE features (gated_attention, normuon, norm_pct_dropout) might close the gap.

!) Retry 5 seed 42 results (int6 quant, full eval pipeline): pre-quant val_bpb: 1.08982 (PR openai#1493: 1.08775, gap +0.002) quantized val_bpb: 1.10014 (PR openai#1493: 1.09947, gap +0.001) quantized_sliding: 1.08327 (PR openai#1493: 1.08271, gap +0.001) quantized_ttt: 1.08243 (PR openai#1493: 1.08103, gap +0.001) Our int6 quant gap: 0.010 BPB (BETTER than PR openai#1493's 0.012!) Our model is 0.0014 behind PR openai#1493 overall. Would be leaderboard openai#2. ISSUE: artifact 16,051,299 bytes — 51 KB over 16 MB cap (16,000,000). Fixable with CMP_QUANT_VALUE_DEDUP=1 (~10-15% smaller) — at int6 scale the dedup is safe (retry 4's catastrophe was int8+dedup combo). Seeds 314/999 running for 3-seed mean. Will have same 51 KB oversize but the val_bpb data is worth collecting before fixing artifact size.

…or 16 MB fit Two changes queued for the next run (not yet launched): 1. PARALLEL_START_LAYER=-1 (CRITICAL BUG FIX) The pre-existing two-lane decoder split mechanism (GPT.__init__:349, default PARALLEL_START_LAYER=7) was SILENTLY OVERRIDING our per-block PARALLEL_RESIDUAL_START=7 for blocks 7-10. Instead of calling Block.forward() (which has our GPT-J parallel residuals logic), the code called forward_attn/forward_mlp on SEPARATE LANES merged once at the end via lane_merge. This is architecturally different from PR openai#1493's GPT-J per-block parallel, and was never validated. Fix: set PARALLEL_START_LAYER=-1 to disable the two-lane mechanism. Block.forward() then handles all blocks, and PARALLEL_RESIDUAL_START=7 gives proper per-block GPT-J parallel matching PR openai#1493. Expected impact: -0.001 to -0.003 BPB (architectural correction). 2. CMP_QUANT_VALUE_DEDUP=1 (SIZE FIX) Retry 5 artifact was 16,051,299 bytes (51 KB over 16 MB cap). Dedup should save ~10-15% on compressed artifact. Retry 4's catastrophic gap was int8+dedup; int6+dedup is a different combo and should be safe per NIGHT_MODE validation. Plan: single-seed (SEEDS=42) validation on the existing pod after retry 5 finishes. Cost ~$8. If val_bpb improves + artifact fits, submit PR + request credits for 3-seed validation.

…issue)

…nking HIGH priority Key findings from daily scan: - Merged SOTA updated to 1.0810 (bigbag, PR openai#1493, Apr 9) — was stale at 1.1147 - New target: ≤1.0760 bpb (beat by ≥0.005 nats) - ANS weight compression (PR openai#1510): 1.6MB freed = +2.2M params, zero legality risk - Parameter Banking + Parallel Muon (PR openai#1523): +5.2% throughput, ~30 free steps - Free wins: Muon momentum 0.97 (-0.0004 bpb), QK-Gain 5.25 (monotonic vs 5.0) - Per-Pass Loop Embeddings (PR openai#1518): reduces quant gap 0.0131→0.0114 - Do NOT implement: Eval-Time Hash Emb (illegal pattern), Tap-In V6 (await ruling) - CLAUDE.md: updated SOTA, target, current approach, technique table, Session 9 lessons https://claude.ai/code/session_01FLdCggVuuBKQCUy6J3xyss

newjordan · 2026-04-10T17:40:54Z

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) — this was essential for running the experiments that led to this result. The grant covered ~320 compute hours across 160+ experiments over Steps 1-22 of our optimization journey.

BPB-weighted loss weights each token's CE loss by its UTF-8 byte count, aligning training objective with BPB eval metric. Muon momentum 0.97. Byte weights from base_bytes_lut, clamped min=1.0, non-persistent.

Reverted BPB-weighted loss (caused torch.compile slowdown, timed out 2x). Clean forward with standard mean CE. Stacking two proven improvements: - Muon momentum 0.97 (measured -0.00129 in R20v10) - TTT LR 0.01 (measured -0.0003 in PR openai#1523)

QK-Gain was 5.0 (code default) but openai#1493 was tested with 5.25 (env var). Env vars not forwarded to GPU — hardcode the correct value. Stacking all three proven hyperparameter improvements.

Wider recurrence: blocks 2-5 looped 3x (was blocks 3-5). 19 virtual layers from 11 physical (was 17). Wider span may converge better than deeper with same block range.

…ven)

Porting all openai#1523 hyperparams that differ from openai#1493: - EMA_DECAY: 0.9965 -> 0.997 (stronger smoothing) - WARMDOWN_FRAC: 0.72 -> 0.667 (shorter warmdown) - Muon 0.97 (kept from previous best)

Decoded LZMA-compressed SOTA train_gpt.py. Replaced flash_attn_3_func with PyTorch SDPA (transpose to B,H,T,D format + enable_gqa). Full stack: 11L, 4xMLP, LeakyReLU², XSA, depth recurrence, parallel residuals, LN Scale, partial RoPE, EMA, GPTQ SDClip, TTT, brotli. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@clarkkev

…(3-seed mean) 3-seed mean sliding val_bpb: 1.05869 (std 0.00038) Seeds: 42 (1.05840), 1337 (1.05856), 2024 (1.05912) All artifacts under 16,000,000 bytes. Zero pruning needed. Key techniques: - SP8192 tokenizer + GPTQ SDClip (int6 k=12.85, int8 embeddings k=20.0) - 3-layer depth recurrence (L3-5, 14 virtual layers from 11 physical) - Parallel residuals (L7+, GPT-J style) - Pre-quant AdamW TTT (6 epochs, compiled for 2x speedup) - QK-Gain 5.25, MuonEq-R, EMA 0.9965, warmdown 72% Built on: PR openai#1394 @clarkkev, PR openai#1493 @bigbag, PR openai#1485 @ndokutovich

owizdom mentioned this pull request Apr 9, 2026

Non-record: Pre-Quant TTT 11ep + Val-Calibrated GPTQ + SLOT-24 — quad-stack synthesis (validation pending compute) #1498

Open

7 tasks

cocohearts merged commit bac888c into openai:main Apr 9, 2026

cocohearts mentioned this pull request Apr 9, 2026

Update README leaderboard for April records #1511

Merged

SH-Tan pushed a commit to SH-Tan/parameter-golf that referenced this pull request Apr 9, 2026

Merge pull request openai#1493 from bigbag/submission/sp8192-ttt-clean

69593a6

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)

dexhunter mentioned this pull request Apr 9, 2026

Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) #1514

Open

7 tasks

taka6745 mentioned this pull request Apr 10, 2026

SP8192 + Gated Attention + NorMuon + Norm-PCT-Dropout + Legal TTT — val_bpb 1.0824 #1520

Open

6 tasks

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20v6: openai#1493 merged SOTA + hash embedding (no Tilt = no timing …

cf075b6

…issue)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20v7: openai#1493 base WITHOUT hash (control run)

9aa35bc

aryanbhosale mentioned this pull request Apr 10, 2026

Record: SP8192 + Muon 0.97 + 3-Layer Recurrence + Parallel Residuals + TTT — val_bpb 1.0802 (3-seed mean) #1521

Open

EthanYangTW mentioned this pull request Apr 10, 2026

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523

Open

MatoTeziTanka mentioned this pull request Apr 10, 2026

PR Acceptance Order and Competition Rules - A discussion - I want to know what you think #1522

Open

abaybektursun mentioned this pull request Apr 10, 2026

Record: Wider Loop + Per-Pass Embeddings + Muon 0.98 + Tap-In V6 + Legal TTT (1.077741 3-seed mean) #1518

Open

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20v8: trigram hash 32K on openai#1493 base

c4f8b5f

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20v9: openai#1493 + Muon 0.97 + hash 16K (training + eval novel combo)

a3fc2a3

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20v10: openai#1493 + Muon 0.97 only (control, no hash)

d25f35f

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20v11: Muon 0.95 on openai#1493 (momentum sweep)

50f7953

samacqua mentioned this pull request Apr 11, 2026

Record: Varlen attention + fused MLP + doc-independent TTT (1.07643) #1530

Open

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 11, 2026

R20v25: Restore best config (Muon 0.97 on openai#1493, v17 = best pro…

47acbd9

…ven)

aryanbhosale mentioned this pull request Apr 11, 2026

Record: SP8192 + Banking + Triple Recurrence + Parallel Residuals + Muon 0.97 + TTT — val_bpb 1.0790 (5-seed mean) #1533

Open

dexhunter mentioned this pull request Apr 11, 2026

Record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean) #1536

Open

pireylow mentioned this pull request Apr 11, 2026

Non-Record: CAT, Sparsity (Structured and Hessian-Guided), MoE, KAN Negative Results #1537

Open

translatingthename mentioned this pull request Apr 11, 2026

Record: SP8192 + Pre-Quant AdamW TTT + Compiled TTT — val_bpb 1.0587 (3-seed mean) #1539

Open

7 tasks

aryanbhosale mentioned this pull request Apr 11, 2026

Record: SP8192 + VarLen Attention + LoRA TTT + Fused MLP — val_bpb 1.0777 (3-seed mean) #1540

Open

bigbag mentioned this pull request Apr 11, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)#1493

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean)#1493
cocohearts merged 1 commit intoopenai:mainfrom
bigbag:submission/sp8192-ttt-clean

bigbag commented Apr 9, 2026

Uh oh!

bigbag commented Apr 9, 2026

Uh oh!

newjordan commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bigbag commented Apr 9, 2026

Summary

3-Seed Results

Key Techniques

Compliance (Track B)

Credits

Acknowledgements

Reproduction

Test plan

Uh oh!

bigbag commented Apr 9, 2026

Uh oh!

newjordan commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants